This document is the summary of the R for Data Analysis workshop.
All correspondence related to this document should be addressed to:
Omid Ghasemi (Macquarie University, Sydney, NSW, 2109, AUSTRALIA)
Email: omidreza.ghasemi@hdr.mq.edu.auArtwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations
R can be used as a calculator. For mathematical purposes, be careful of the order in which R executes the commands.
10 + 10
## [1] 20
4 ^ 2
## [1] 16
(250 / 500) * 100
## [1] 50
R is a bit flexible with spacing (but no spacing in the name of variables and words)
10+10
## [1] 20
10 + 10
## [1] 20
R can sometimes tell that you’re not finished yet
10 +
How to create a variable? Variable assignment using <- and =. Note that R is case sensitive for everything
pay <- 250
month = 12
pay * month
## [1] 3000
salary <- pay * month
Few points in naming variables and vectors: use short, informative words, keep same method (e.g., you can use capital letters but it is not recommended, use only _ or . ).
Function is a set of statements combined together to perform a specific task. When we use a block of code repeatedly, we can convert it to a function. To write a function, first, you need to define it:
my_multiplier <- function(a,b){
result = a * b
return (result)
}
This code do nothing. To get a result, you need to call it:
my_multiplier (a=2, b=4)
## [1] 8
# or: my_multiplier (2, 4)
We can set a default value for our arguments:
my_multiplier2 <- function(a,b=4){
result = a * b
return (result)
}
my_multiplier2 (a=2)
## [1] 8
# or: my_multiplier (2)
# or: my_multiplier (2, 6)
Fortunately, you do not need to write everything from scratch. R has lots of built-in functions that you can use:
round(54.6787)
## [1] 55
round(54.5787, digits = 2)
## [1] 54.58
Use ? before the function name to get some help. For example, ?round. You will see many functions in the rest of the workshop.
function class() is used to show what is the type of a variable.
TRUE, FALSE can be abbreviated as T, F. They has to be capital, ‘true’ is not a logical data:class(TRUE)
## [1] "logical"
class(F)
## [1] "logical"
class(2)
## [1] "numeric"
class(13.46)
## [1] "numeric"
class("ha ha ha ha")
## [1] "character"
class("56.6")
## [1] "character"
class("TRUE")
## [1] "character"
Can we change the type of data in a variable? Yes, you need to use the function as.---()
as.numeric(TRUE)
## [1] 1
as.character(4)
## [1] "4"
as.numeric("4.5")
## [1] 4.5
as.numeric("Hello")
## Warning: NAs introduced by coercion
## [1] NA
When there are more than one number or letter stored. Use the combine function c() for that.
sale <- c(1, 2, 3,4, 5, 6, 7, 8, 9, 10) # also sale <- c(1:10)
sale <- c(1:10)
sale * sale
## [1] 1 4 9 16 25 36 49 64 81 100
Subsetting a vector:
days <- c("Saturday", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
days[2]
## [1] "Sunday"
days[-2]
## [1] "Saturday" "Monday" "Tuesday" "Wednesday" "Thursday" "Friday"
days[c(2, 3, 4)]
## [1] "Sunday" "Monday" "Tuesday"
my_vector with numbers from 0 to 1000 in it and calculate mean, median, sd, min, max, and sum of that vector:my_vector <- (0:1000)
mean(my_vector)
## [1] 500
median(my_vector)
## [1] 500
min(my_vector)
## [1] 0
range(my_vector)
## [1] 0 1000
class(my_vector)
## [1] "integer"
sum(my_vector)
## [1] 500500
sd(my_vector)
## [1] 289.1081
List allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other list.
my_list = list(sale, 1, 3, 4:7, "HELLO", "hello", FALSE)
my_list
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 4 5 6 7
##
## [[5]]
## [1] "HELLO"
##
## [[6]]
## [1] "hello"
##
## [[7]]
## [1] FALSE
Factors store the vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character. For example, variable gender with “male” and “female” entries:
gender <- c("male", "male", "male", " female", "female", "female")
gender <- factor(gender)
R now treats gender as a nominal (categorical) variable: 1=female, 2=male internally (alphabetically).
summary(gender)
## female female male
## 1 2 3
gender
## [1] male male male female female female
## Levels: female female male
So, be careful of spaces!
rep() function):gender <- c(rep("male",30), rep("female", 40))
gender <- factor(gender)
gender
## [1] male male male male male male male male male male
## [11] male male male male male male male male male male
## [21] male male male male male male male male male male
## [31] female female female female female female female female female female
## [41] female female female female female female female female female female
## [51] female female female female female female female female female female
## [61] female female female female female female female female female female
## Levels: female male
There are two types of categorical variables: nominal and ordinal. How to create ordered factors (when the variable is nominal and values can be ordered)? We should add two additional arguments to the factor() function: ordered = TRUE, and levels = c("level1", "level2"). For example, we have a vector that shows participants’ education level.
edu<-c(3,2,3,4,1,2,2,3,4)
education<-factor(edu, ordered = TRUE)
levels(education) <- c("Primary school","high school","College","Uni graduated")
education
## [1] College high school College Uni graduated Primary school
## [6] high school high school College Uni graduated
## Levels: Primary school < high school < College < Uni graduated
patient and control values. Here, the first level is control and the second level is patient. Change the order of levels, so patient would be the first level:health_status <- factor(c(rep('patient',5),rep('control',5)))
health_status
## [1] patient patient patient patient patient control control control control
## [10] control
## Levels: control patient
health_status_reordered <- factor(health_status, levels = c('patient','control'))
health_status_reordered
## [1] patient patient patient patient patient control control control control
## [10] control
## Levels: patient control
Finally, can you relabel both levels to uppercase characters? (Hint: check ?factor)
health_status_relabeled <- factor(health_status, levels = c('patient','control'), labels = c('Patient','Control'))
health_status_relabeled
## [1] Patient Patient Patient Patient Patient Control Control Control Control
## [10] Control
## Levels: Patient Control
All columns in a matrix must have the same mode(numeric, character, etc.) and the same length. It can be created using a vector input to the matrix function.
my_matrix = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, ncol = 3)
my_matrix
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
Data frames can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data type. Let’s create a dataframe:
id <- 1:200
group <- c(rep("Psychotherapy", 100), rep("Medication", 100))
response <- c(rnorm(100, mean = 30, sd = 5),
rnorm(100, mean = 25, sd = 5))
my_dataframe <-data.frame(Patient = id,
Treatment = group,
Response = response)
We also could have done the below
my_dataframe <-data.frame(Patient = c(1:200),
Treatment = c(rep("Psychotherapy", 100), rep("Medication", 100)),
Response = c(rnorm(100, mean = 30, sd = 5),
rnorm(100, mean = 25, sd = 5)))
In large data sets, the function head() enables you to show the first observations of a data frames. Similarly, the function tail() prints out the last observations in your data set.
head(my_dataframe)
tail(my_dataframe)
| Patient | Treatment | Response | |
|---|---|---|---|
| 1 | 1 | Psychotherapy | 25.67751 |
| 2 | 2 | Psychotherapy | 30.50688 |
| 3 | 3 | Psychotherapy | 27.85405 |
| 4 | 4 | Psychotherapy | 31.18748 |
| 5 | 5 | Psychotherapy | 26.43779 |
| 6 | 6 | Psychotherapy | 29.94332 |
| Patient | Treatment | Response | |
|---|---|---|---|
| 195 | 195 | Medication | 37.82939 |
| 196 | 196 | Medication | 29.60533 |
| 197 | 197 | Medication | 26.16398 |
| 198 | 198 | Medication | 23.64795 |
| 199 | 199 | Medication | 31.70907 |
| 200 | 200 | Medication | 35.42236 |
Similar to vectors and matrices, brackets [] are used to selects data from rows and columns in data.frames:
my_dataframe[35, 3]
## [1] 29.57839
my_dataframe[1:10, ]
| Patient | Treatment | Response |
|---|---|---|
| 1 | Psychotherapy | 25.67751 |
| 2 | Psychotherapy | 30.50688 |
| 3 | Psychotherapy | 27.85405 |
| 4 | Psychotherapy | 31.18748 |
| 5 | Psychotherapy | 26.43779 |
| 6 | Psychotherapy | 29.94332 |
| 7 | Psychotherapy | 31.88499 |
| 8 | Psychotherapy | 24.30275 |
| 9 | Psychotherapy | 29.62321 |
| 10 | Psychotherapy | 25.47421 |
How to get only the Response column for all participants?
my_dataframe[ , 3]
## [1] 25.67751 30.50688 27.85405 31.18748 26.43779 29.94332 31.88499 24.30275
## [9] 29.62321 25.47421 28.67835 44.49822 38.95770 26.71795 32.86743 42.92951
## [17] 39.43184 25.28679 33.79008 26.63997 30.29671 29.71972 37.59836 25.21335
## [25] 23.60956 25.40918 32.97853 31.19940 33.52826 23.30590 34.06822 28.86365
## [33] 28.68556 24.65501 29.57839 33.65160 34.34322 34.88337 33.18487 26.96075
## [41] 24.85969 30.39736 28.99848 35.35618 33.59227 24.40365 33.69494 39.44119
## [49] 25.38689 17.62718 29.38772 36.28467 32.72525 32.65730 33.28753 20.59795
## [57] 29.39656 30.47841 29.72631 30.00521 28.40262 45.39311 33.73790 26.12625
## [65] 23.75714 36.57074 30.14156 24.95673 27.88779 25.00442 25.27760 25.44098
## [73] 26.21632 28.23137 38.68072 36.40383 38.53229 30.75424 25.71612 29.67645
## [81] 23.35774 41.43341 35.42844 27.04721 36.19928 33.90553 31.23478 31.37820
## [89] 23.47448 35.14562 24.35033 26.32279 28.11685 25.02537 40.15536 28.78169
## [97] 24.27265 34.10561 32.77914 28.53924 24.44748 25.91238 25.41619 32.77701
## [105] 26.24443 20.20428 14.95122 33.18519 22.92441 23.41872 28.23043 26.70726
## [113] 19.49191 21.57979 21.17556 15.86531 20.19090 28.84964 24.95565 17.15686
## [121] 36.20514 21.61509 29.94009 24.55726 29.75400 29.53863 28.62965 24.30133
## [129] 23.59685 29.22331 20.96769 27.23252 26.04409 28.03235 26.88758 31.41231
## [137] 16.46511 28.84658 22.50181 25.28338 21.68744 21.42948 21.21029 26.07754
## [145] 24.13913 33.76351 24.47093 32.22514 36.42094 23.74251 27.21891 35.39804
## [153] 26.97973 25.25728 14.17212 24.51770 28.84445 19.45879 29.29256 26.29966
## [161] 23.45768 32.36189 29.09744 22.18670 25.16793 16.59150 25.17464 22.56768
## [169] 20.15450 31.92731 24.13743 25.32165 24.07822 24.98594 21.81218 29.17458
## [177] 18.59470 34.26587 31.61718 22.37180 37.62753 21.69863 32.74779 32.97598
## [185] 34.63547 21.96093 24.07853 35.22956 22.56170 23.40207 31.76184 27.97739
## [193] 25.03920 19.07650 37.82939 29.60533 26.16398 23.64795 31.70907 35.42236
Another easier way for selecting particular items is using their names that is more helpful than number of the rows in large data sets:
my_dataframe[ , "Response"]
# OR:
my_dataframe$Response
So far, we created dataframes using data.frame function from the base R. However, a better way to create dataframes is to use the tibble function from tidyverse (see here).